Realities of Hashing:

An Experimental Comparison of Modern Hashing Techniques

 

William Garrison

University of Maryland Baltimore County

December 1998

 

Minimal perfect hash functions have been sought after in computer science as the optimal data structure for many applications.  The established mindset is that these perfect functions are fast and efficient, but difficult to create.  This experiment compares one implementation of a near-minimal perfect hashing with common methods of open addressing and chaining.  It shows that at reasonable load factors, open addressing is nearly as efficient as perfect hashing, and that the costs of chaining may make it an acceptable method for applications where resizing the table is not feasible.  Additionally, a comparison of hash functions shows that new, more uniformly distributed hash functions are not likely to have a significant payoff over the established division and multiplication methods.  Finally, it proposes modifications to the near-minimal perfect hash algorithm used here to create perfect minimal hash tables for any data set.

The reader is assumed to have a basic understanding hash tables, and certain base concepts are not explained in detail.

Introduction

Hash tables are a common and important data structure to many computer programs.  They are created for speed, and much time is spent analyzing and optimizing hash tables, hash functions, uniform distribution, and related algorithms and subroutines.  At present, there are several algorithms generally accepted as “optimal” for open addressing, and the memory/speed trade-offs are well documented.

One major area of analysis is in perfect hash tables.  Two properties defining a minimal perfect hash table(Schmidt[2]) are:

·         It allows keyword recognition in a static search set using at most one probe into the hash table. This represents the "perfect" property.

·         The actual memory allocated to store the keywords is precisely large enough for the keyword set, and no larger. This is the "minimal" property.

They are designed to minimize insertion time by producing a hash function which hashes N elements into N slots(Kalpakis[6]).  Under common conceptions, such perfect algorithms are very difficult, time consuming, and not always possible to create.  In most cases near-minimal perfect hash functions are used.  Usually they are applied to static data sets where the resulting table can be memorized rather than generated in real time. Compilers commonly do this with keywords to optimize compile times.(Schmidt[2])

This document describes a set of experiments which confirm many of the assumptions about open addressing, but finds that minimal perfect hashing is not significantly faster than open addressing at equal memory loads.  Additionally, these test show that the common division and multiplication functions for approximating uniform distribution are nearly perfect, and new techniques do not result in significant payoff.  Finally, a modification to the near-minimal perfect hash function is proposed, which should be  capable of providing “optimal” hashing to any data set.

Implementation

Open Addressing

Open addressing is the standard hash table implementation.  It is based off of resolving collisions by probing for free slots. Probing for either free slots or matching keys dominates insertion and search time.  The hash formula determines the length and complexity of these probes.  A well behaved hash function will result in searches spanning the entire table.  If it does not, the probe will redundantly check the same slot multiple times, and insertions may fail even if room exists in the table.

The hashing function determines the distribution of keys in the table.  An ideal hash function exhibits uniform distribution, where slots are picked evenly throughout the table, maximizing speed and table capacity.  Insertion and Search times are bounded by the length of the longest probe in the table while average case performance follows the average probe length.  Hash functions can exhibit primary and secondary clustering problems(Cormen[5] 234).  Primary clustering is when large blocks of occupied slots build up from similar probe sequences.  Secondary clustering is the result of identical runs caused by different keys hashing to the same initial point.  Both types of clustering can severely increase the length of the average probe sequence, but primary clustering is a more significant performance factor.

Three open addressing methods are compared in this experiment:  linear, quadratic, and double.  The hash formulas H(key, index) for each of these techniques are shown in the table below.

 

 

Hash Type

Hash function H(key, index)

Resulting distribution

 

Linear Hash

(key + C ´ i) mod N

1st degree progression through slots

 

Quadratic Hash

(key + A ´ i + B ´ index2) mod N

2nd degree progression through slots

 

Double Hash

[key + A ´ index + B ´ index ´ H’(key,index)] mod N

H’(key, index) = key mod (N-C)

Varying progression based on key, index

 

The hash functions used in this experiment are not iterative.  This means that to generate sequences of values for a given key combination, the index value i is incremented with each call to H().  An alternative is to pass the result of the prior H() and combine the new value with the prior one.  This iterative method may result in a faster hash function by eliminating the need for to i parameter, but requires additional coding.  Experimental profiling shows that the speed of the hash function is only a concern with an extremely low load factor.

Linear hashing begins searching at the collision point by stepping through the table, incrementing by the constant value C.  So long as C is prime with respect to the table size, the search will span the entire table.  A reasonable choice for C is a large prime constant.  To be completely safe, it can be computed when the table is sized to guarantee quality.  Linear hashing results in significant primary and secondary clustering problems, and is the least effective hashing method.

Quadratic hashing resolves some of the behavior issues present in a linear hash.  It produces more varying probe sequences, which are less likely to result in primary clustering problems.  Secondary clustering still occurs since the probe is determined solely by the initial slot.

Double hashing is the current standard for efficient open addressing.  It addresses both the primary and secondary clustering problems by using a second hash function which is dependent upon both the index and the key.  This results in a new probe sequence for each key, rather than each hash slot.  The choice of the second hash function is important to the uniform performance of double hashing.  The H’() function shown in the above table is a general guideline for an effective secondary hash function which maintains a prime size relative to the table size(Knuth[7]).  This technique results in a more complex hash function,  although the distribution advantages outweigh the performance overhead.

Open addressing performance is relative to the load factor of the table: the number of elements in proportion to the table size.  More free slots results in improved throughput.  If the number of elements in the table exceeds the size of the table, the table must be resized.  This is done by reallocating the table with more elements, rehashing the elements into the table, then removing the old table.  This can be a hindrance where speed is important, or memory is limited.

Chaining

An alternative technique to open addressing is chaining.  Each slot in the table is a reference to a secondary data structure for storing elements.  Typically, this structure is a linked list, but can be an array, tree, heap, hash table, or other structure.  When a key value is hashed into a slot in the table, it is appended to the data structure referenced by this slot.  This technique solves the complexities of open addressing by deferring the storage issue to the secondary structure.

Chained search performance is dependent on the time required to search the secondary structure.  In the case of a linked list, this is dependent on the length of the lists.  In the worst case, all elements use the same slot in the table, and one list contains all the elements.  This results in q(N) performance(Knuth[7] 224).  During a uniform distribution the time reflects a more average case behavior. In a table containing E elements and N lists, the average length of a list is E ¸ N.  Thus, if the number of elements is the same as the number of lists, the average length is 1, resulting in optimal Q(1) performance.  Notice that this is also equal to the load factor of the table.

Chained insertion is also dependent on the secondary structure.  In the case of a linked list or a tree, insertion can be implemented in q(1).  This results in a structure which is unaffected by load factor.  Balanced trees or other more optimal structures may have more complex insertion times, but are often faster to insert than to search.

An interesting effect of chaining is that the table may contain more elements than slots.  This is possible since the secondary data structure can store more than one element.  If the structure used is open ended, a hash table based on chaining could contain an infinite number of elements in a finite number of slots.  Thus, the load factor is >1.0, which will affect performance.  It is possible to attach an efficient secondary structure to a chained hash table, and obtain performance greater than that of open addressing.  Suppose an open hash table of size N averaging j probes/search.  A chained hash table of N binary search trees would require only ln j probes/search, and fewer computations per probe since there is no hash function calculation.  The result is better performance.  For the open addressing table to match the performance of the chained table, it would require N2 entries.

Since chained tables are usually open ended, they do not suffer from the requirement to resize as in open addressing.  Chained tables can get a significant speed boost after being resized, but it is not always necessary.

Near-Minimal Perfect Hashing Algorithm

Some implementations of  near-minimal perfect hashing do not guarantee a solution without multiple runs.  They randomly or heuristically generate hash functions until one is found with the desired characteristics. The algorithm used here produces a larger table than such techniques, but results in a valid table with each run.  It is described in detail in Kalpakis[6] “Almost Perfect Hashing.”  The algorithm declares a header table with rows consisting of elements used to find entries in the data table.  The header table allows a one step collision resolution using three values (p, i, r):

 

Header Table

 

p

i

r

 

starting position of element(s)

parameter to hash function

length of run of elements

The p value is the starting position of the elements referred to by this header entry.  One header entry may correspond to multiple keys.  The r parameter indicates the number of elements in this collision set.  When a header entry refers to only one key, r is 1.  For collisions, r > 1.  The i value is the hash function parameter to H(key, i, r) needed to find the offset from p for the element.

Searches are very simple in perfect hashing.  Let s be the size of the header table and k be the value of the key.  First, compute the header slot x for the key by computing x = k mod s.  Next, compute the data table slot y = p + H(k, i ,r).  The H() function returns a value between 0 and r-1 which is the offset from p of the element.  If the element in the data table at offset y is correct, then the value is found.  If it does not match, the value is not in the table.

The perfect property of this hash function is that no probes are required to find an element in the table.  The element is indexed directly and found in one lookup.  This makes searches optimal regardless of load factor. Collisions are resolved by using the i parameter to allow one entry in the header table to refer to multiple locations in the data table.  By using the key and the i value, this case still requires only one step, regardless of the length of the collision set.

Insertions are more complex in this hash algorithm.  When an element is being added to the table, an allocation function is used to allocate free space in the data table.  This allocation function is key to the time and space efficiency of the algorithm.  When a single element is being added, the allocation function can return any free block.  During a collision, it must find a contiguous run of free blocks to relocate the entire collision set in addition to the new element.

To insert an element, the header slot must be computed as in the search.  If the element is free, a data slot can be allocated and placed in p.  The pair (i, r) becomes (0,1) to indicate a single element.  During a collision, the r value is incremented to indicate an additional element, and a new i value is computed.  The only requirement of i is that it return a unique value for each key in the collision set.  The computation of i is trivial, and can be implemented via a random search or by stepping through values.  The value chosen for i is dependent upon the hash function used.  It is not reasonable to compute a value for i any faster than guessing since it would requiring solving a multiple variable equation, and would be dependent upon the hash function.  This computation is not nearly as significant a factor as the allocation routine.

Perfect has functions can only hold as many elements as there are slots in the data table.  Due to collisions, the header table may have multiple keys/header slot.  Thus, space is likely to run out in the data table before the header table, since the data table requires one slot/key.  In addition, fragmentation of the data table can make insertions fail, even given sufficient space.

Asymptotic analysis of running time for insertions in the near-minimal perfect hash table:

·                     Standard insertion: 65%

                        Lookup slot in header table:               Q(1)

                        Find free space in data table:             Q(N)

·                     Complex insertion: 35%

                        Find unique M:                                     Q(R)

                        Find free space in data table:             Q(N)

                        Moving entries:                                    Q(R)

                         

Experiments yield an average R = 2.2.  Thus the O(N) allocation factor dominates insertion time regardless of which case occurs.  The means complex insertions will not be the significant time factor and collision rate will be less important than table size.  It may even occur that larger data table sizes produce poorer speed results.

 

Running time for insertion:

O(N)

 

Running time for search:

O(1)

The above analysis assumes a O(N) running time for the allocation function.  This dominates the insertion time.  If the allocation function were implemented using a linked list, worst case running time would remain the same, but average allocation time would become a constant factor.  This is likely to be the most important optimization.

Hash Functions

To test the effectiveness of varying the hash function, four different hash functions were created for H(key, i, r):

Division Method:                 (key mod (Ai + Br + C)) mod r.

Multiplication Method:       fraction(key(Ar + Bi + C)) × r

Reciprocal Method:             ëA ¸ (Bkey + Ci)û mod r

Random Method:                 srand(key + Ar), å0i random()

 

Integers are assumed to be unsigned values ranging from 0 to MAXINT. The value of MAXINT is platform specific and irrelevant.  The A, B, and C constants are chosen to be “as prime as possible” with respect to the table size (Smith[1]).  Ideally, these values should be chosen when the table is created or resized, so as to assure this property.  For this experiment, large prime constants were used.  The value of key is computed from the data to be hashed, and is assumed to be a random integer. The i value is an integer index used for computing sequential hash values.  The r value is an integral range for the hash function.  H(key, i, r) returns values in the range 0 to r-1.

The division and multiplication methods are the most common and effective hash functions.  The test results confirmed their effectiveness.  They resulted in excellent distribution, and were simpler to code than the reciprocal method.  The reciprocal method introduces complexities, and is not necessarily effective in the general case.  For this function, the choice of A is crucial.  Suppose we have the reciprocal function:

f(x) = ëN ¸ xû

where N is an integer constant and x ranges from 1 to N.  There are at most 2 × ëÖNû - 1 possible values for f(x).  This is because many values of x will result in the same value.  One solution would be to apply the fraction() function here to only look at the decimal, turning this into a variation of the multiplication method.  Alternatively, we can maximize the range of f(x).  To do this, x should never exceed ÖN.  Consequently, to preserve the effectiveness of the reciprocal hash function, the constant A must be MAXINT2.  This restores the number of possible values back to the full range of integers.  Compensating for this limitation increases the range and complexity of the values, which may present implementation concerns such as speed and accuracy.

The advantage of the reciprocal method is that it provides a nonlinear progression of values.  By placing key, i, or r in the denominator, the result before applying the modulus is not a line but a curve, which is more desirable for producing a random distribution of keys.

The fourth method is the random hash.  The concept is to use the system random number generator to provide the hash values.  Most random number generators are based off of exponential functions, which are consistent and reliable hash functions(Smith[1]).  Note that the results for this hash function may vary extremely on different systems, and some implementations may not provide identical results for identical seeds.  The random function is two dimensional.  Along one dimension is the set of seed values.  Along the other is the iterations to the random call.  For this test, the key and r values were used to seed the generator while i was used to iterate through calls to the random function.  This produced reliable and efficient results, on order with the division method.

In minimal perfect hash tables, the hash function becomes even more crucial than in open addressing.  During an insertion which results in a collision, the algorithm must search for a new and unique set of values to rehash the data.  The algorithm is likely to be incapable of determining when such a value set cannot be generated, and may result in an infinite loop.  This was experienced many times which implementing the hash functions.  One particular case is in the division method.  Suppose a collision occurs with two keys of matching parity.  In this case, the we must find a pair of unique values for H(key1, 0, 2) and H(key2, 1, 2).  Depending on the constants A, B, and C in H(), this may not be possible.  Examine the division method equation below:

 

 

1

2

3

 

(key mod

(Ai + Br + C)

) mod r

 

If the value of  section 2 is greater than key1 and key2, the first modulus has no effect.  The result of section 1 and 2 is simply the original key.  Since the value of r is 2, the division method equation will then result in the parity of the key.  If the parity of the two keys is the same, then the hash function return values will be the same.  This is an unlikely problem which can be solved by assuring that keys have some minimum value, and that the expression in section 2 tries results in the range 0 to key.   Note that this problem does not simply occur with r = 2, but it is most common in that case.

To minimize this problem, this implementation begins m at one and proceeds incrementally.  Attempts to use random values resulted in too many numbers larger than keys.  Several alternatives exist to solve this problem. Using m = random() mod key for the guess is a possibility, although we must use the lowest key in the set for this to be guaranteed.  Adding the i constant to the H() function would also quickly and easily prevent this. Another possibility is to modify the division method equation resulting in:

(key mod ((Ai + Br + C) mod key)) mod r.

Here, the inner expression is modified to range from 0 to key.  This would slow computation, and include a circular dependency on the key, which could result in additional anomalies.

Experimental results show that the largest number of guesses for the m value occur during the initial collision of two keys.  Fortunately, this is also the fastest case so additional guesses of m are quickly generated.

Load Factors

A key factor in speed, efficiency, and success of a hash table is the load factor.  Load factor is the ratio of slots to elements.  A minimal hash table has a load factor of 1.0.  In reality, a hash table will seldom function in this case.  For open addressing to work with a load factor of 1.0, the hash function must visit every slot before revisiting a slot already examined.  This occurs if the hash function is completely prime with respect to the table size.  The testing results show that this did not always happen, nor should it be expected that it will.  In the case of perfect hashing, the slots are allocated rather than computed by a hash function.  Thus, the allocation routine should maintain contiguous free slots.  If it is successful at this, the data will fit, otherwise slack space is required.

Collision rates, and probe lengths vary with load factor.  Collisions occur less often if the keys are spread throughout a larger area.  Probe lengths during insertions and searches also shrink as table size increases. As collisions and probe lengths decrease, insertion and search times decrease.  The effect varies between types of hash tables.

For open addressing, probe lengths affect the time required to find a free slot.  With more free slots, probes are shorter and data is inserted faster.  During searches for data in the table, it is likely to be found sooner if probes are shorter.  If the data is not in the table, shorter probes result in finding a NULL entry faster, which indicates failure on a search.  When average probe length nears 1, the table becomes optimally fast.  However, this case requires extremely low load factors.

For perfect hashing, load factor has a similar effect, but not due to probe lengths.  Probe length does not apply since free slots are allocated not probed.  Instead, larger collision sets will result in increasing the complexity of computing the m value, and increasing the number of elements reallocated and moved.  The result is a proportional effect for a different reason.  Search times should remain unaffected, requiring only one lookup for each element.

For chaining, elements are appended to the end of a linked list.  If a pointer to the end of the list is stored, this is only one step regardless of the length of the list.  Probe length does affect the size of the linked list that must be searched, and causes a proportional increase in search time to that experienced by open hashing.  Chaining requires only a single pointer dereference to find the next slot to check, rather than computation of a hash function.  For this reason, chaining is not affected  by load factors nearly as heavily as other methods.

In all hash tables, once collision rate nears 0%, and probe length nears 1, load factor no longer affects speed.

Allocation of free space in perfect hashing is unique.  Since there is both a data table and a header table, the question arises of which table will benefit from additional space.  The answer is both, but for different reasons.  Increasing the size of the header table results in fewer collisions and smaller collision sets, just as it does in open addressing.  Increasing the size of the data set helps solve the contiguous allocation problems, however, space beyond what is necessary for successful insertion(approximately 1%) has no effect.

Load Factors and Memory

Load factor is an important aspect when instantiating a hash table.  Higher load factors result in higher speed, but increased memory costs.  Each hash table has an associated cost per element:

 

N = table size

s = header size

E = number of elements

k = Linked list overhead

 

 

Hash table

Memory use

Elements stored

 

Open Addressing

2N + Data

key, data pointer

 

Perfect Hashing

3s + 2N + Data

p,i,r / key, data pointer

 

Chaining

2kN + 2 ´ 3E + Data

prev, next, element

 

The open addressing hash table requires a slot for a key and a slot for a data pointer for each slot allocated.  In some real-world implementations, storing the actual data in the table will be unnecessary, and only the key is required.  Perfect hashing requires 3 elements in the header table for each header slot, in addition to the requirements of the open addressing table.

The memory requirement for chaining is based primarily on the number of elements, not the pre-allocated size.  Chaining can vary greatly based on implementation.  The version used in the project uses two double linked lists: one for keys and one for data elements.  Each list has a minimal overhead k, and each entry in the linked list requires a previous and next pointer as well as the element being stored. This is a very inefficient structures, since a singly linked list would suffice, and a proper node containing both the data pointer and the key could save the cost of pointers and overhead.

Perfect hashing requires 2.5x the memory of open addressing if data pointers are included(5 memory elements versus 2) and 4x the memory if data pointers are not included(4 memory elements to 1). Chaining requires even more than perfect hashing when it is highly loaded.  When comparing load factors, it may be most appropriate to compare by memory usage rather than load factor, if memory is a concern.  Thus, when examining graphs, examine open addressing at twice the load factor as perfect hashing to compare like memory uses.

With perfect hashing the size of the header and data tables can be considered independently.  Increasing the size of the header table reduces collisions.  Increasing the data table size reduces problems caused by non-contiguous allocation.  To allow for additional elements, both should be increased.  It is possible to create a header table with fewer elements than in the data table.  This may be advantageous since the header table occupies more memory, and experimentation shows that collisions are not a significant performance factor. It may be possible to create a perfect hash which uses memory as efficiently as open addressing by significantly decreasing the value of s.  Further experimentation is be necessary to draw meaningful conclusions.

In cases where the data elements are particularly large, the memory requirements of the hash table may become insignificant in proportion to the data.  In these cases, operating with a large table size may not be a measurable expense.

Experimentation

The tests were conducted with each hashing method run at varying load factors.  The tables were tested at size of 25%, 50%, 100%, 101%, 150%, 200%, 400%, and 1000% of the minimal size. Load factors can be computed as 1 ¸ table size factor.  (200% = ½ = 50% load factor).  Note that perfect and quadratic hash functions failed to resolve all of the elements using less than 101% table sizes. Failed insertions resulted in extremely large insertion times.  Open addressing and perfect hashing require at least N slots for N elements, thus chaining is the only method which was tested with 25% and 50% table sizes.

Timing was computed using a system timer reporting accuracy to 1ms, although with resolution closer to 5ms.  This is sufficient for the testing conducted here.  All tests were run with at least 5 runs, some with more. All variances are less than 10% of the average reported timing.  Reported times include only the time required for the hash insertions and searches.  Overhead such as computing keys from string values, loading of the data set, and instantiation of the classes are not included.  The timing procedure is illustrated below:

·                     Load entire data set into array, compute and store keys while loading.

·                     Instantiate hash table

·                     Begin timing

                        Insert elements from array

                        Insertion time = time()

                        Search for each element from array

                        Search time = time() - Insertion time

·                     End timing

 

The data was generated from the recursive directory listing of a hard drive, and resulted in 15,338 unique elements. String lengths ranged from 5-118 chars, and unique keys were successfully generated for all strings.  Many of these elements are very similar, having only the last character different.  This is due to matching paths and similar names and is likely to cause higher than average collision rates due to similar key values.

An additional test was performed using the GNU GPERF program on the data set, and on a smaller 505 element data set.  Unfortunately, these tests were not successful due both to poor key generation in the strings, and the methodology of GPERF, which does not guarantee solutions for any data set.

Observations

Insertion Time

The graph “Insertion Times” shows average time to insert the entire data using each hash table, and with progressively higher table sizes (decreasing load factors).  When examining relative speed, note that the scale is logarithmic to fit the high variance of times on a single graph.  As expected, load factor is the most significant speed determinant.

The simplest observations refers to chaining times.  This hash table did not vary its timing significantly with load factor.  This is logical since the insertion time is q(1) regardless of all other factors.  The only notable timing change occurs once the load factor becomes large, and the memory requirements exceed available physical RAM.  This condition should never occur, since creating a chained table with such large load factors provides no advantage.  The tolerance for high load factors holds even with loads >1.0, where the table is smaller than the data set.  Additional testing shows that speed only increases once the average list length approaches 100!  Despite this consistency, chained insertions were significantly slower than open addressing times with table sizes over about 1.5x.

Open addressing times decreased as table size increased (load factor decreased).  This is due to decreased collisions and decreased probe length.  The speed gains for open addressing diminish as load factor approaches 1.5x.  This indicates that 50% slack space is sufficient to remove the most significant probes.  This value could be a reasonable guideline for instantiating new tables or determining when to resize.  Above a factor of 2.0x linear hashing becomes the fastest.  Once the probe lengths become small, the complexity of the hash function becomes a more dominant factor, and simplicity becomes efficiency.

Linear hashing saw the largest gains from table size, due to the extremely long probe lengths from high primary and secondary clustering.  Double and Quadratic hash functions show the best performance at high load.  Once the probe lengths become short, the speed of the hash function becomes the dominant measure, and linear hash becomes more efficient than double or quadratic hashing.  At high load factors, double and quadratic hashing resulted in similar performance.

The slowest insertion times came from the perfect hash table.  This is also the only table to constantly increase speed with load factor.  To explain the extreme speed difference, and the large dependence on load factor, the insertion process was profiled in detail.  Results showed that 90% of the time spent in the perfect hash insertion was the allocation of free slots in the data table.  The graph shows an additional hash function named Perfect*, which is the timing of the perfect hash function if the allocation time was not considered.  This result shows perfect* hash insertion times nearly independent of load factor, and with speeds comparable to open addressing.  Speeds were faster than open addressing at high load factors.

This perfect* hash table is more efficient than open addressing since the collision resolution process does not involve probing.  The remaining time is the computation of the new hash function for a collision, and the moving of data elements.  These factors are not as significant as probe length.  This theoretical perfect* hash function is likely to be attainable.  The implementation of the allocation function used in this test is an optimized linear search for elements.  This could be more efficiently implemented as a linked list of available slots, lowering the allocation time from q(N) to a nearly constant factor, dependent upon fragmentation level rather than number of slots.

The graph “Collisions” shows total number of collisions during the insertion process.  It indicates that collision rate decreased in proportion to table size, and in equal proportion for all open addressing methods.  This means that the speed variance amongst open addressing methods is not caused by collision rate. Chaining and perfect hashing had a lower collision rate since collisions do not result in decreasing available slots, merely appending to the end of the chain.

The “Total Probes” graph shows the number of probes during the insertions and search processes, and is similar to the average probe length.  This graph is highly proportional to open addressing speed.  It shows that the probe lengths for a linear hash are significantly higher than the lengths for other methods.  It also confirms that the convergence of speeds is due to convergence of probe lengths, and that the number of probes does not change significantly beyond table sizes of 1.5x.

Once total collisions and probe lengths match, the remaining time factor is the computation of the hash function.  This explains why linear hashing is the fastest at a small load factor. This speed advantage is not asymptotically significant, and should not be the focus of development or optimization.  It will only provide a constant factor speed increase, and only in trivial situations where the table is not loaded.

Search Time

The “Search Times” graph shows average time to search for every element in the data set using each hash table, and with progressively larger table sizes.  The scale is similar to that of the insertion time, showing a wide range of times.  Search times are generally similar to insertion times.

Chained search performance follows the results from the insertion time.  Times remained consistent regardless of load factor, with some minor speed gain from lower probes at higher table sizes.  Once the table size exceeded physical memory, timings increased significantly.  It is interesting that the probe length did not affect search time for chaining.  This indicates that probe lengths maintained uniformity throughout all tests, and that the linked list structure can search large numbers of elements more efficiently and consistently than a hash probe.

For open addressing, search time and insertion time are both proportional to probe length.  Thus the results are similar.  At high load factors, linear hashing showed the slowest timings while quadratic and double hashing resulted in similar, faster times.  Beyond table sizes of 1.5x, linear hashing showed the best performance due to the increased weighting on the speed hash function.  All times did increase slightly once table sizes approached 10x.  This is likely due to hardware anomalies such as cache efficiency.

Perfect hashing exhibited consistent timings regardless of table size, which is the intended effect of perfect hashing.  Speed increased slightly at large table sizes.  Since perfect hashing uses more memory than open addressing, it is likely to experience the cache coherency problems before open addressing.  Due to this effect, linear hashing times were faster than perfect hashing at large sizes, although this difference is not asymptotically significant.

A the highest table size, average linear probe length is 1.05  probes/search.  (16146 probes ¸15338 searches)  Since the probe length becomes 1, the linear hash does not need to compute its hash function to step to the next slot.  However, the perfect hashing algorithm still calls its hash function, even if there is only one element referenced by the slot.  A simple optimization would eliminate the hash computation when the r value in the header table is 1.  In this case, the data table slot can be directly referenced by the p value in the header table.  Due to the complexity of computing the perfect hash function, this would likely account for the speed advantage of linear hash, and eliminate it.  Regardless of this optimization, the asymptotic running time of perfect hash is equal or greater than that of linear hash.

Variation of Hash Function

Several graphs titled “Perfect Hash - Varying Hash Functions” show the results of using different hash functions with the perfect hash.  In general, the graphics show very little change in performance or performance related statistics when varying the hash function.

The graph subtitled “Insertion Times” shows the average insertion time using each hash function H(), dependent upon load factor. The results show that the de-facto standard division method hash returns the best performance in all cases.  The random hash method actually edges out the division method slightly, which shows the potential of a complex exponential related algorithm.  The complexity of such a function had little negative effect on speed, but provided a better uniform distribution and thus a better time.  This should be considered a near perfect distribution, and the graph shows how closely existing standard hash methods come to approximating this.

The graphs subtitled “Collisions” and “Moves” show no change resulting from varying the hash function.

The “Guesses” graph shows the number of attempts to find a new hash function for a collision set during the insertion.  In theory, a more uniform hash function would result in a greater spread of hash results, and a simpler search for unique values.  The resulting variance is actually very minor, and shows that the standard hash functions have equally good uniformity.  Additionally, the results of this graph do not correlate with the “Insertion Times” graph, showing that the number of guesses is not a significant component of speed.

Conclusions

A Solution To Minimal Perfect Hashing?

Conventional wisdom is that minimal perfect hashing is not reasonably possible due to the complexity of finding a hash function resulting in no collisions.  Although this implementation did not achieve total perfection, it is possible to modify it to achieve this for nearly any data set.  The key lies in the allocation of free slots in the data table.

The data table can become non-contiguous when a collision occurs.  The elements involved in the collision must be re-allocated to slots which have enough contiguous space for all the elements in the collision.  Unless the elements involved in the collision were allocated immediately before the collision, they will already have elements after them, and the space will not be contiguous.

 

 

Hash Table

s = 10

 

 

Hash Table

s = 10

 

 

key

index

run length.

 

 

key

index

run length.

 

 

27

0

1

Single entry

 

27

0

1

Single entry

 

38

0

1

Single entry

 

38

0

1

Single entry

 

42

0

1

Single entry

 

Free

0

1

Reallocated

 

93

0

1

Single entry

 

93

0

1

Single entry

 

474

0

1

Single entry

 

474

0

1

Single entry

 

Free

 

 

 

 

42

0

2

Moved

 

Free

 

 

 

 

852

1

2

Inserted

 

Insert element key = 852

Collision with element key = 42

 

 

Result of insertion, collision, reallocation, and move.

 

 

The above illustration shows the creation of a non-contiguous section.  Should a new element be inserted that collided with keys 42, 852 then there would be insufficient contiguous space to place the new element, even though a free slot does exist in the table.  The allocation function must be modified to recontiguate the free space by re-arranging the table elements.

 

 

Hash Table

s = 10

 

 

key

index

run length.

 

 

27

0

1

Single entry

 

38

0

1

Single entry

 

93

0

1

Single entry

 

474

0

1

Single entry

 

42

0

2

Run of two

 

852

1

2

 

 

Free

 

 

 

 

Recontiguated space

 

 

After recontiguation, an element colliding with 42 or 852  could be successfully inserted.

Test results show that only 1% slack space is required for the existing allocation function to create a near-minimal perfect hash table.  Even with a high collision rate on the data set used, the longest collision run was 7 elements on a data set of 15,000.  This indicates that very little recontiguation would be required to fit the data into the table.  However, given the small amount of slack space required, the coding complexity of this new allocation function, and the run-time cost of recontiguating a data set, it may not be worth creating.  It may be that on a data set with a 90% collision rate, this may be the only viable solution to result in reasonable search times.  Still, such a data set is not likely, and perhaps in such a case the programmer should look at the key generator.

 An alternate approach may be possible based on sorting the data before insertion.  The header slot which an element uses is key mod s.  Let us define the cardinality of a data element to be its hash table header slot, key mod s.  Sort the data set by this function.  This results in a data set ordered by the header table slot.  Since matching table slots will be together, collisions will occur immediately.  This means that allocation of contiguous slots can be done without creating holes, since elements already inserted are guaranteed not to have collisions.  Only the last element inserted can have a collision.  Allocation is also simplified since there is no need to go back and scan for free slots.

 

 

Hash Table

s = 10

 

 

key

index

run length.

 

 

42

0

2

Inserted first since 42 % 10 = 2

 

852

1

2

Inserted next since 582 % 10 = 2

 

93

0

1

Single entry

 

474

0

1

Single entry

 

27

0

2

Single entry

 

38

1

2

Single entry

 

Free

 

 

 

 

Sorted insertion

 

 

In this example, element 852 is inserted immediately following element 42, not 2 elements later.  Thus the contiguous space problem never occurs.  Implementing this method would be far simpler than the contiguous allocation method. Running time would include the cost of a sort function, usually N ln N. It would be difficult to compare this to the method of modifying the allocator, since contiguous allocation is statistical and running times are difficult to compute.  A reasonable extension to this experiment would be to test and compare these implementations.

These two solutions address the issue of unneeded slack space at the end of the table, but they do not completely solve the problem of minimality.  The table still requires a 2x to 4x overhead for the header table, which cannot be eliminated. Comparing this type of minimal perfect hashing with our earlier definition, we see that it fails the second requirement: the minimal property.  To distinguish this memory optimization from a minimal-perfect hash, this will be regarded as optimal perfect hashing, where the only additional memory cost is a fixed constant for required overhead, and not unused slots.

To compare minimal hashing with open addressing based on memory use, we must look at varying positions on the scale.  Comparing minimal hashing with a table size of 1x with open addressing at 2x to 4x, we see that in equal amounts of memory, open addressing is faster in both insertion and sort.

Timing

The results of this experiment confirm some of the standard assumptions about open addressing, but show that much of the research on related topics such as hash function speed and uniform distribution are unimportant.

·         Load factor is the most significant determinant of speed in open addressing

·         Speed of hash functions is not asymptotically important

·         Uniform distribution is achievable using standard hash functions

 

Chaining is shown to be a viable alternative to open addressing and perfect hashing when the size of the data set is unknown and table resizing is a significant problem.

 

In regards to perfect hashing, this experiment shows that:

·         Near-minimal perfect hashing is not faster than open addressing at equal memory loads

·         A subset of minimal perfect hashing, optimal perfect hashing, is possible but has little value

Summary

Hash table research still continues from the need to make more optimal use of computing resources.  Near infinite possibilities still exist for chaining, and minimal perfect hashing has yet to be fully explored.  Knowledge of open addressing techniques have likely reached their peak, and further research along this line is not likely to bring significant returns.


Related Resources

The source code, experimental data, and the results are available at http://moby.ml.org/hash_paper.  Code was written and tested using Microsoft Visual C++ V5.0 on a K6-2/233.  Source was compiled in debug mode with no optimizations.  Data and graphs were compiled using Microsoft Works/Windows ’95.

References

1. An Exponential Open Hashing Function  Based on  Dynamical Systems Theory.   Brad Smith

                October 1997 <http://www.nmia.com/~monsmith/diss/diss.html>

2. User's Guide to gperf.   Douglas C. Schmidt

                September 1998 <http://www.mikrus.pw.edu.pl/lds/GNU/gperf/gperf_2.html>

3. A Program That Performs Minimal Perfect Hashing.   Kendall V. Scott

<http://softdocwiz.com/MinPerHash.htm>

4. Hash Functions for Hash Table Lookup.   Robert J. Jenkins Jr.

1995-1997  <http://ourworld.compuserve.com/homepages/bob_jenkins/doobs.htm>

5. Introduction to Algorithms.   Thomas H. Cormen

©1990 MIT Press

6. Almost Perfect Hashing.   Dr. K. Kalpakis, Dr. Alan Sherman

                November 1998 <http://www.cs.umbc.edu/~kalpakis/441-201/prj.html>

7. The Art of Computer Programming.(Vol 3, 2nd Ed)   Donald E. Knuth

©1998 Addison-Wesley

 

Special thanks to Dr. Alan Sherman for aid in the creation of this paper, and to both Dr. Sherman and Dr. Kalpakis for the near-minimal perfect hash algorithm.